Customer Segmentation with K-Means:Part3¶

In [70]:
from IPython.display import Image
Image(filename="D:\IBM_Data_Science\Image\Articles-Customer-Segmentation_Eng_A.png",width=700,height=300)
Out[70]:

Table of contents

  1. Customer Segmentation FAQ
  2. Import Libraries And Explore Data Analysis
  3. RMF_Technique
  4. Creating_A_Score
  5. Average Purchase Value (APV)
  6. Evaluation


1. Customer Segmentation FAQ

What does customer segmentation mean?

Customer segmentation is the process of dividing a customer base into distinct groups of individuals that have similar characteristics. This process makes it easier to target specific groups of customers with tailored products, services, and marketing strategies. By segmenting customers into different classes, businesses can better understand their needs, preferences, and buying patterns, allowing them to create more personalized and effective marketing campaigns.

What are the 4 types of customer segmentation?

  • Demographic Segmentation: This type of segmentation divides customers into different groups based on shared characteristics such as age, gender, income, occupation, education level, marital status and location. -Psychographic Segmentation: This type of segmentation divides customers into different groups based on their lifestyle, interests, values and attitudes.
  • Behavioral Segmentation: This type of segmentation divides customers into different groups based on their purchase history, usage patterns, brand loyalty and response to marketing campaigns.
  • Geographic Segmentation: This type of segmentation divides customers into different groups based on location, such as country, region, city or neighborhood. What is customer segmentation example?

Customer segmentation is the practice of dividing customers into distinct groups with common characteristics. Examples of customer segmentation include geographic segmentation (dividing customers by region), demographic segmentation (dividing customers by age, gender, marital status, etc.), behavioral segmentation (dividing customers by purchase behavior, usage patterns, loyalty, etc.), and psychographic segmentation (dividing customers by attitudes, values, lifestyle, etc.).

In the second part of the project, we will use the clustering technique known as K-Means, a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, and it is often used in marketing and customer segmentation The correct implementation of the algorithm can lead to clearer and more successful business decisions All of the modules and library used for this part of the project are listed below

Customer Segmentation with K-Means

In the second part of the project, we will use the clustering technique known as K-Means, a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, and it is often used in marketing and customer segmentation.

The correct implementation of the algorithm can lead to clearer and more successful business decisions.

All of the modules and library used for this part of the project are listed below.

2. Import Libraries And Explore Data Analysis

2.1. ImportLiraries

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import plotly.express as px
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")
  • In the first of the project, after all the data was treated and cleaned, from the final dataframe we created the a file called clustering.csv that is going to be used to build our dataframe and perform the customer segmentation
In [72]:
## Import Dataset
df = pd.read_csv("D:\IBM_Data_Science\Data_set\Data_Kagle_Github\clustering.csv")
print(f"Number of clients: {df.shape[0]}\nNumber of features: {df.shape[1]}")
Number of clients: 2237
Number of features: 29

2.2 Explore Data Analysis

2.2.1 Let's display the columns ang the head of the DataFrame

In [73]:
pd.set_option("display.max_columns",29)
df.head()
Out[73]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Total_Purchases Age
0 5524 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 1617 66
1 2174 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 27 69
2 4141 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 776 58
3 6182 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 0 53 39
4 5324 1981 PhD Married 58293.0 1 0 2014-01-19 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 0 422 42
  • In order to perform proper customer segmentation, we will use the RMF technique, which will be explained further along the project. For that we need the customers' frequency, which is the number of purchases made in a given period of time.

  • We don't have that value explicitally, but we can calculate it but summing the number of purhases made by our customers through different channels (Deals, Web, etc) and assign it to df['Frequency'].

2.2.2 Descriptive Statistic Analysis

In [74]:
df.describe().transpose()
Out[74]:
count mean std min 25% 50% 75% max
ID 2237.0 5587.024139 3245.458565 0.0 2826.0 5454.0 8420.0 11191.0
Year_Birth 2237.0 1968.866786 11.793625 1899.0 1959.0 1970.0 1977.0 1996.0
Income 2237.0 51973.077747 21417.001805 1730.0 35523.0 51717.0 68281.0 162397.0
Kidhome 2237.0 0.443898 0.538421 0.0 0.0 0.0 1.0 2.0
Teenhome 2237.0 0.506482 0.544593 0.0 0.0 0.0 1.0 2.0
Recency 2237.0 49.110416 28.952136 0.0 24.0 49.0 74.0 99.0
MntWines 2237.0 304.329012 336.651531 0.0 24.0 174.0 505.0 1493.0
MntFruits 2237.0 26.328565 39.793044 0.0 1.0 8.0 33.0 199.0
MntMeatProducts 2237.0 167.160036 225.793687 0.0 16.0 67.0 232.0 1725.0
MntFishProducts 2237.0 37.565937 54.654409 0.0 3.0 12.0 50.0 259.0
MntSweetProducts 2237.0 27.097005 41.297645 0.0 1.0 8.0 33.0 263.0
MntGoldProds 2237.0 44.063478 52.188898 0.0 9.0 24.0 56.0 362.0
NumDealsPurchases 2237.0 2.325436 1.932803 0.0 1.0 2.0 3.0 15.0
NumWebPurchases 2237.0 4.087617 2.779366 0.0 2.0 4.0 6.0 27.0
NumCatalogPurchases 2237.0 2.664730 2.924095 0.0 0.0 2.0 4.0 28.0
NumStorePurchases 2237.0 5.794814 3.250624 0.0 3.0 5.0 8.0 13.0
NumWebVisitsMonth 2237.0 5.316942 2.428060 0.0 3.0 6.0 7.0 20.0
AcceptedCmp3 2237.0 0.072865 0.259974 0.0 0.0 0.0 0.0 1.0
AcceptedCmp4 2237.0 0.074654 0.262890 0.0 0.0 0.0 0.0 1.0
AcceptedCmp5 2237.0 0.072865 0.259974 0.0 0.0 0.0 0.0 1.0
AcceptedCmp1 2237.0 0.064372 0.245469 0.0 0.0 0.0 0.0 1.0
AcceptedCmp2 2237.0 0.013411 0.115052 0.0 0.0 0.0 0.0 1.0
Complain 2237.0 0.008941 0.094152 0.0 0.0 0.0 0.0 1.0
Response 2237.0 0.149307 0.356471 0.0 0.0 0.0 0.0 1.0
Total_Purchases 2237.0 606.544032 602.307901 5.0 69.0 397.0 1047.0 2525.0
Age 2237.0 54.101922 11.700664 27.0 46.0 53.0 64.0 83.0

Check information In DataFrame

In [75]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2237 entries, 0 to 2236
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2237 non-null   int64  
 1   Year_Birth           2237 non-null   int64  
 2   Education            2237 non-null   object 
 3   Marital_Status       2237 non-null   object 
 4   Income               2237 non-null   float64
 5   Kidhome              2237 non-null   int64  
 6   Teenhome             2237 non-null   int64  
 7   Dt_Customer          2237 non-null   object 
 8   Recency              2237 non-null   int64  
 9   MntWines             2237 non-null   int64  
 10  MntFruits            2237 non-null   int64  
 11  MntMeatProducts      2237 non-null   int64  
 12  MntFishProducts      2237 non-null   int64  
 13  MntSweetProducts     2237 non-null   int64  
 14  MntGoldProds         2237 non-null   int64  
 15  NumDealsPurchases    2237 non-null   int64  
 16  NumWebPurchases      2237 non-null   int64  
 17  NumCatalogPurchases  2237 non-null   int64  
 18  NumStorePurchases    2237 non-null   int64  
 19  NumWebVisitsMonth    2237 non-null   int64  
 20  AcceptedCmp3         2237 non-null   int64  
 21  AcceptedCmp4         2237 non-null   int64  
 22  AcceptedCmp5         2237 non-null   int64  
 23  AcceptedCmp1         2237 non-null   int64  
 24  AcceptedCmp2         2237 non-null   int64  
 25  Complain             2237 non-null   int64  
 26  Response             2237 non-null   int64  
 27  Total_Purchases      2237 non-null   int64  
 28  Age                  2237 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 506.9+ KB
In [76]:
# Count of null value
df.isnull().sum()
Out[76]:
ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
Total_Purchases        0
Age                    0
dtype: int64
  • All columns of DataFrame with no values are null
In [77]:
# Count row and column
df.shape
Out[77]:
(2237, 29)

2.2.3 DataVisualization

In [78]:
## Check the interaction between variables
## Covariance matrix
hm=df.corr()
hm
Out[78]:
ID Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Total_Purchases Age
ID 1.000000e+00 0.006260 -8.012841e-07 0.002160 -0.002481 -0.046002 -0.021528 0.005420 -0.003341 -0.023630 -0.006655 -0.012417 -0.036954 -0.017685 -0.002289 -0.013269 -0.007159 -0.035636 -0.024965 -0.007082 -0.021223 -0.014892 0.031625 -0.021330 -0.016604 -0.002652
Year_Birth 6.260469e-03 1.000000 -2.011147e-01 0.234200 -0.357774 -0.018042 -0.165010 -0.021361 -0.035101 -0.045233 -0.021592 -0.065954 -0.065923 -0.152577 -0.127285 -0.136666 0.121202 0.061368 -0.063000 0.005793 -0.007387 -0.007252 -0.004163 0.019520 -0.118099 -0.992095
Income -8.012841e-07 -0.201115 1.000000e+00 -0.510538 0.033721 0.007618 0.686263 0.505592 0.684526 0.518786 0.518396 0.384645 -0.107538 0.450341 0.693913 0.627274 -0.646814 -0.015325 0.217420 0.395311 0.324942 0.103900 -0.027948 0.160671 0.789544 0.198770
Kidhome 2.159591e-03 0.234200 -5.105382e-01 1.000000 -0.034983 0.008129 -0.496385 -0.372747 -0.437180 -0.387666 -0.370693 -0.349750 0.221685 -0.361915 -0.502529 -0.499878 0.447897 0.014840 -0.161553 -0.205618 -0.172309 -0.081704 0.036366 -0.079835 -0.556864 -0.233855
Teenhome -2.481094e-03 -0.357774 3.372061e-02 -0.034983 1.000000 0.016931 0.004487 -0.176969 -0.261669 -0.204576 -0.162838 -0.021824 0.388462 0.155632 -0.110972 0.050394 0.135255 -0.042823 0.038790 -0.191288 -0.140288 -0.015664 0.007591 -0.154730 -0.138899 0.363105
Recency -4.600180e-02 -0.018042 7.618002e-03 0.008129 0.016931 1.000000 0.016042 -0.004309 0.023074 0.001068 0.022610 0.016423 -0.000499 -0.010769 0.024986 0.000882 -0.021473 -0.033036 0.018836 0.000119 -0.019313 -0.001787 0.005380 -0.198679 0.020402 0.019397
MntWines -2.152827e-02 -0.165010 6.862631e-01 -0.496385 0.004487 0.016042 1.000000 0.389327 0.562324 0.399392 0.386157 0.387116 0.010756 0.541899 0.634960 0.641681 -0.320997 0.061909 0.373165 0.472551 0.354024 0.205878 -0.036016 0.246920 0.891724 0.162739
MntFruits 5.420492e-03 -0.021361 5.055923e-01 -0.372747 -0.176969 -0.004309 0.389327 1.000000 0.542912 0.594663 0.567000 0.392760 -0.132427 0.296363 0.487687 0.461467 -0.418645 0.014545 0.009967 0.215697 0.194618 -0.009852 -0.002933 0.125048 0.614073 0.013618
MntMeatProducts -3.340869e-03 -0.035101 6.845257e-01 -0.437180 -0.261669 0.023074 0.562324 0.542912 1.000000 0.568188 0.523576 0.350258 -0.122682 0.293276 0.723660 0.479189 -0.539824 0.018018 0.102687 0.373649 0.309631 0.042939 -0.020810 0.236050 0.842860 0.030701
MntFishProducts -2.362977e-02 -0.045233 5.187856e-01 -0.387666 -0.204576 0.001068 0.399392 0.594663 0.568188 1.000000 0.579681 0.422630 -0.139575 0.293309 0.534257 0.459514 -0.446233 0.000150 0.016637 0.199421 0.260633 0.002491 -0.018887 0.111056 0.642633 0.042341
MntSweetProducts -6.655384e-03 -0.021592 5.183958e-01 -0.370693 -0.162838 0.022610 0.386157 0.567000 0.523576 0.579681 1.000000 0.369414 -0.120304 0.348162 0.490641 0.448341 -0.423551 0.001299 0.028415 0.259439 0.241674 0.009892 -0.020467 0.117070 0.602752 0.019372
MntGoldProds -1.241657e-02 -0.065954 3.846450e-01 -0.349750 -0.021824 0.016423 0.387116 0.392760 0.350258 0.422630 0.369414 1.000000 0.048995 0.421499 0.437378 0.381226 -0.250945 0.122906 0.022048 0.180854 0.166238 0.049911 -0.029969 0.139569 0.523954 0.057253
NumDealsPurchases -3.695428e-02 -0.065923 -1.075381e-01 0.221685 0.388462 -0.000499 0.010756 -0.132427 -0.122682 -0.139575 -0.120304 0.048995 1.000000 0.233956 -0.008857 0.068576 0.347480 -0.023182 0.015537 -0.183390 -0.123356 -0.037736 0.003665 0.002145 -0.065397 0.068281
NumWebPurchases -1.768482e-02 -0.152577 4.503415e-01 -0.361915 0.155632 -0.010769 0.541899 0.296363 0.293276 0.293309 0.348162 0.421499 0.233956 1.000000 0.377923 0.502200 -0.056139 0.041914 0.155693 0.138470 0.154954 0.034086 -0.013249 0.148390 0.519420 0.153874
NumCatalogPurchases -2.288528e-03 -0.127285 6.939128e-01 -0.502529 -0.110972 0.024986 0.634960 0.487687 0.723660 0.534257 0.490641 0.437378 -0.008857 0.377923 1.000000 0.518320 -0.520763 0.104513 0.139041 0.322188 0.307972 0.099780 -0.018348 0.220526 0.778427 0.125244
NumStorePurchases -1.326865e-02 -0.136666 6.272741e-01 -0.499878 0.050394 0.000882 0.641681 0.461467 0.479189 0.459514 0.448341 0.381226 0.068576 0.502200 0.518320 1.000000 -0.429084 -0.068033 0.179123 0.214568 0.183025 0.085090 -0.011539 0.038801 0.674256 0.139230
NumWebVisitsMonth -7.158723e-03 0.121202 -6.468143e-01 0.447897 0.135255 -0.021473 -0.320997 -0.418645 -0.539824 -0.446233 -0.423551 -0.250945 0.347480 -0.056139 -0.520763 -0.429084 1.000000 0.061171 -0.032180 -0.278200 -0.192573 -0.007217 0.020857 -0.004061 -0.500722 -0.117408
AcceptedCmp3 -3.563572e-02 0.061368 -1.532489e-02 0.014840 -0.042823 -0.033036 0.061909 0.014545 0.018018 0.000150 0.001299 0.122906 -0.023182 0.041914 0.104513 -0.068033 0.061171 1.000000 -0.079627 0.080219 0.094661 0.071981 0.009916 0.254144 0.053071 -0.061105
AcceptedCmp4 -2.496497e-02 -0.063000 2.174199e-01 -0.161553 0.038790 0.018836 0.373165 0.009967 0.102687 0.016637 0.028415 0.022048 0.015537 0.155693 0.139041 0.179123 -0.032180 -0.079627 1.000000 0.306452 0.251225 0.292184 -0.026978 0.176890 0.253097 0.064261
AcceptedCmp5 -7.081531e-03 0.005793 3.953115e-01 -0.205618 -0.191288 0.000119 0.472551 0.215697 0.373649 0.199421 0.259439 0.180854 -0.183390 0.138470 0.322188 0.214568 -0.278200 0.080219 0.306452 1.000000 0.403019 0.221503 -0.008356 0.326532 0.470005 -0.015381
AcceptedCmp1 -2.122259e-02 -0.007387 3.249420e-01 -0.172309 -0.140288 -0.019313 0.354024 0.194618 0.309631 0.260633 0.241674 0.166238 -0.123356 0.154954 0.307972 0.183025 -0.192573 0.094661 0.251225 0.403019 1.000000 0.175283 -0.024913 0.293882 0.381434 0.008147
AcceptedCmp2 -1.489225e-02 -0.007252 1.039002e-01 -0.081704 -0.015664 -0.001787 0.205878 -0.009852 0.042939 0.002491 0.009892 0.049911 -0.037736 0.034086 0.099780 0.085090 -0.007217 0.071981 0.292184 0.221503 0.175283 1.000000 -0.011074 0.169249 0.135748 0.007622
Complain 3.162482e-02 -0.004163 -2.794846e-02 0.036366 0.007591 0.005380 -0.036016 -0.002933 -0.020810 -0.018887 -0.020467 -0.029969 0.003665 -0.013249 -0.018348 -0.011539 0.020857 0.009916 -0.026978 -0.008356 -0.024913 -0.011074 1.000000 0.000185 -0.033840 0.004450
Response -2.133020e-02 0.019520 1.606711e-01 -0.079835 -0.154730 -0.198679 0.246920 0.125048 0.236050 0.111056 0.117070 0.139569 0.002145 0.148390 0.220526 0.038801 -0.004061 0.254144 0.176890 0.326532 0.293882 0.169249 0.000185 1.000000 0.264962 -0.018554
Total_Purchases -1.660364e-02 -0.118099 7.895444e-01 -0.556864 -0.138899 0.020402 0.891724 0.614073 0.842860 0.642633 0.602752 0.523954 -0.065397 0.519420 0.778427 0.674256 -0.500722 0.053071 0.253097 0.470005 0.381434 0.135748 -0.033840 0.264962 1.000000 0.113501
Age -2.652311e-03 -0.992095 1.987700e-01 -0.233855 0.363105 0.019397 0.162739 0.013618 0.030701 0.042341 0.019372 0.057253 0.068281 0.153874 0.125244 0.139230 -0.117408 -0.061105 0.064261 -0.015381 0.008147 0.007622 0.004450 -0.018554 0.113501 1.000000

Count number of columns ["Education"]

In [79]:
plt.figure(figsize=(12,7))
sns.countplot(x=df['Education'])
Out[79]:
<AxesSubplot:xlabel='Education', ylabel='count'>
In [80]:
plt.figure(figsize=(12,7))
sns.countplot(x=df['Marital_Status'])
Out[80]:
<AxesSubplot:xlabel='Marital_Status', ylabel='count'>

Distribution of (median_house_value")

In [81]:
plt.figure(figsize=(12,6))
sns.distplot(df['Total_Purchases'],color="blue")
plt.title("The graph shows the distribution of (Total_Purchases)")
Out[81]:
Text(0.5, 1.0, 'The graph shows the distribution of (Total_Purchases)')
In [82]:
plt.figure(figsize=(10,5))
sns.distplot(df['Year_Birth'],color='red')
plt.title("The graph shows the distribution of (Year_Birth)")
Out[82]:
Text(0.5, 1.0, 'The graph shows the distribution of (Year_Birth)')

The distribution of (mean_income)

In [83]:
plt.figure(figsize=(12,8))
sns.distplot(df['Income'],color='purple')
plt.title("The graph shows the distribution of (Income)")
Out[83]:
Text(0.5, 1.0, 'The graph shows the distribution of (Income)')

The distribution of (median_income)

In [84]:
plt.figure(figsize=(12,8))
sns.distplot(df['Age'],color='green')
plt.title("The graph shows the distribution of (Age)")
Out[84]:
Text(0.5, 1.0, 'The graph shows the distribution of (Age)')

3. RMF_Technique

3.1 Fit Frequency,Total_Purchases,Recency¶

In [85]:
df['Frequency'] = df[['NumDealsPurchases', 'NumWebPurchases',
                  'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
df[['ID','Frequency']].head()
Out[85]:
ID Frequency
0 5524 25
1 2174 6
2 4141 21
3 6182 8
4 5324 19
  • In order to perform proper customer segmentation, we will use the RMF technique, which will be explained further along the project. For that we need the customers' frequency, which is the number of purchases made in a given period of time.

  • We don't have that value explicitally, but we can calculate it but summing the number of purhases made by our customers through different channels (Deals, Web, etc) and assign it to df['Frequency'].

To perform our customer segmentation the technique to be applied is clustering with SKLearn K-Means.

It is a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based ontheir similarity, in order to make better and more accurate business decisions.

Before we proceed with the technique itself, we need to establish the following:

  • SEED , a parameter that garantees the reproducibility of our clustering
  • The fine tuning tuning of hyperparameters with RandomizedSearchCV

There are a few ways to determine an optimal number or clusters, but we will keep in mind the goal of our project: make better decisions based on customers behavior. In this case, we will define a number of cluster between two and four ( 'n_clusters': [i for i in range(2,5)] ) and let the algorithm tell us which is the optimal number. It wouldn't be viable to have 10, 20, clusters because we need to establish a limited number of groups in order to escalate our business actions.

That said, we will establish three customers features to create clusters for each of those features:

  • Recency
  • Total Purchases (Monetary)
  • Frequency Let's get to it!
  • Let's display the columns and the head of the DataFrame
In [86]:
pd.set_option('display.max_columns', 29)
df.head()
Out[86]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts ... NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Total_Purchases Age Frequency
0 5524 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635 88 546 172 88 ... 8 10 4 7 0 0 0 0 0 0 1 1617 66 25
1 2174 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11 1 6 2 1 ... 1 1 2 5 0 0 0 0 0 0 0 27 69 6
2 4141 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426 49 127 111 21 ... 8 2 10 4 0 0 0 0 0 0 0 776 58 21
3 6182 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11 4 20 10 3 ... 2 0 4 6 0 0 0 0 0 0 0 53 39 8
4 5324 1981 PhD Married 58293.0 1 0 2014-01-19 94 173 43 118 46 27 ... 5 3 6 5 0 0 0 0 0 0 0 422 42 19

5 rows × 30 columns

  • In order to perform proper customer segmentation, we will use the RMF technique, which will be explained further along the project. For that we need the customers' frequency, which is the number of purchases made in a given period of time.
  • We don't have that value explicitally, but we can calculate it but summing the number of purhases made by our customers through different channels (Deals, Web, etc) and assign it to df['Frequency']

RandomizedSearchCV: https://www.projectpro.io/recipes/find-optimal-parameters-using-randomizedsearchcv-for-regression.

In [87]:
df['Frequency'] = df[['NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
df[['ID','Frequency']].head()
Out[87]:
ID Frequency
0 5524 25
1 2174 6
2 4141 21
3 6182 8
4 5324 19

To perform our customer segmentation the technique to be applied is clustering with SKLearn K-Means.

It is a popular unsupervised machine learning algorithm used for cluster analysis. The algorithm groups similar data points into clusters based on their similarity, in order to make better and more accurate business decisions.

Before we proceed with the technique itself, we need to establish the following:

  • SEED , a parameter that garantees the reproducibility of our clustering
  • The fine tuning tuning of hyperparameters with RandomizedSearchCV

There are a few ways to determine an optimal number or clusters, but we will keep in mind the goal of our project: make better decisions based on customers behavior. In this case, we will define a number of cluster between two and four ( 'n_clusters': [i for i in range(2,5)] ) and let the algorithm tell us which is the optimal number. It wouldn't be viable to have 10, 20, clusters because we need to establish a limited number of groups in order to escalate our business actions.

That said, we will establish three customers features to create clusters for each of those features:

  • Recency
  • Total Purchases (Monetary)
  • Frequency

Let's get to it

In [88]:
SEED = 1234
np.random.seed(SEED)
kmeans = KMeans()
parameters = {'n_clusters': [i for i in range(2,5)], 'init':['k-means++', 'random'],
'n_init':[10,20,30,40,50], 'max_iter':[300,500,700,100]}
random_search = RandomizedSearchCV(kmeans, parameters, random_state=SEED)

Fit Frequency

In [89]:
random_search.fit(df[['Frequency']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters:  {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score:  -1989.5993461036724

Fit Total_Purchases

In [90]:
random_search.fit(df[['Total_Purchases']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters:  {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score:  -9340371.11741151

Fit Recency

In [91]:
random_search.fit(df[['Recency']])
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters:  {'n_init': 30, 'n_clusters': 4, 'max_iter': 500, 'init': 'random'}
Best score:  -23783.01205702912
  • Now that we have all the optimal values for the hyperparameters, we will we create the columns df['Cluster_Frequency'] , df['Cluster_Purchases'] , df['Cluster_Recency'] and fit df['Frequency'] , df['Total_Purchases'] , df['Recency'] in their respective objectives with their own hyperparameters tuning.
In [92]:
kmeans1 = KMeans(n_init= 30, n_clusters = 4, max_iter = 500, init= 'random')
kmeans2 = KMeans(n_init= 30, n_clusters = 4, max_iter = 500, init= 'random')
kmeans3 = KMeans(n_init= 10, n_clusters = 4, max_iter = 300, init= 'random')
df['Cluster_Frequency'] = kmeans1.fit_predict(df[['Frequency']])
df['Cluster_Purchases'] = kmeans2.fit_predict(df[['Total_Purchases']])
df['Cluster_Recency'] = kmeans3.fit_predict(df[['Recency']])
df[['ID','Cluster_Recency', 'Cluster_Purchases', 'Cluster_Recency']].head()
Out[92]:
ID Cluster_Recency Cluster_Purchases Cluster_Recency
0 5524 3 0 3
1 2174 1 3 1
2 4141 1 2 1
3 6182 1 3 1
4 5324 2 2 2

3.3 Creating_A_Score¶

RMF, or Recency, Monetary, Frequency, is a technique used in customer relationship management and marketing to segment customers based on their purchasing behavior. The three components of the RMF technique are:

  • Recency: The length of time since the customer's last purchase
  • Monetary: The amount of money spent by the customer
  • Frequency: The number of purchases made by the customer

Now that we have the clusters of our customers assigned to each of these components, we will group them by the average value of these components, and assign a component score for each cluster, ranging from 0 to 3 . The higher the average of each cluster for Monetary and

Frequency , the higher the score, and the lower for Recency , the higher the score

Reset Index Frequency

In [93]:
frequency_group = df.groupby('Cluster_Frequency')['Frequency'].mean().sort_values(ascending=True).to_frame().reset_index()
frequency_group['index'] = frequency_group.index
frequency_group
Out[93]:
Cluster_Frequency Frequency index
0 3 6.612048 0
1 0 14.460076 1
2 1 20.518182 2
3 2 26.861027 3

Reset Index Recency

In [94]:
recency_group = df.groupby('Cluster_Recency')['Recency'].mean().sort_values(ascending=False).to_frame().reset_index()
recency_group['index'] = recency_group.index
recency_group
Out[94]:
Cluster_Recency Recency index
0 2 84.622699 0
1 3 56.898305 1
2 1 31.922780 2
3 0 9.601677 3

Reset Index Total_Purchases

In [95]:
purchases_group = df.groupby('Cluster_Purchases')['Total_Purchases'].mean().sort_values(ascending=True).to_frame().reset_index()
purchases_group['index'] = purchases_group.index
purchases_group
Out[95]:
Cluster_Purchases Total_Purchases index
0 3 100.580374 0
1 2 575.307159 1
2 1 1121.424837 2
3 0 1765.003636 3

DataFrame RFM Score

In [96]:
frequency_score = {2:0,0:1,1:2,3:3}
recency_score = {0:0, 2:1, 1:2, 3:3}
monetary_score = {2:0,1:1,3:2,0:3}
df['Frequency_Score'] = df['Cluster_Frequency'].map(frequency_score)
df['Recency_Score'] = df['Cluster_Recency'].map(recency_score)
df['Monetary_Score'] = df['Cluster_Purchases'].map(monetary_score)
df[['Frequency_Score','Recency_Score', 'Monetary_Score']].head()
Out[96]:
Frequency_Score Recency_Score Monetary_Score
0 0 3 3
1 3 2 2
2 2 2 0
3 3 2 2
4 2 1 0
  • Now that the customers have a score for each component, we will create a general score in df['RMF_Score'] , ranging from 0 to 9 . This is the parameter that will define a customer as Premium , Occasional , Loyal , Moderate or Inactive .
In [97]:
df['RMF_Score'] = df[['Frequency_Score','Recency_Score', 'Monetary_Score']].sum(axis = 1)
df[['ID','RMF_Score']]
Out[97]:
ID RMF_Score
0 5524 6
1 2174 7
2 4141 4
3 6182 7
4 5324 3
... ... ...
2232 10870 6
2233 4001 5
2234 7270 4
2235 8235 2
2236 9405 5

2237 rows × 2 columns

Let assign to each costumer a segmenation, based on their scores

  • RMF Score = 0 -> Inactive

  • RMF Score = 1 or 2 -> Occasional

  • RMF Score = 3 or 4 -> Moderate

  • RMF Score = 5 or 5 -> Loyal

  • RMF Score = 7 , 8 , or 9 -> Premium

In [98]:
df['Segmentation'] = 'Inactive'
df.loc[df['RMF_Score']>=1, 'Segmentation'] = 'Occasional'
df.loc[df['RMF_Score']>=3, 'Segmentation'] = 'Moderate'
df.loc[df['RMF_Score']>=5, 'Segmentation'] = 'Loyal'
df.loc[df['RMF_Score']>=7, 'Segmentation'] = 'Premium'
sns.countplot(df['Segmentation'])
plt.title('Number of customers by Segmentation')
plt.show()
In [99]:
df['Segmentation'].value_counts().sort_values(ascending=False).to_frame()
Out[99]:
Segmentation
Loyal 871
Moderate 625
Premium 481
Occasional 239
Inactive 21

Above, you can see the number of clients by segmentation

Let's now plot a 3 d Sctterplot showing the clusters and its componet and the three components of our score

In [101]:
fig = px.scatter_3d(df, x='Frequency', y='Recency', z = 'Total_Purchases', color='Segmentation')
fig.show()

It is important to notice that, unlike other situations in which we use clustering techniques, the clusters were not as well defined like it may happen in a recommendation system, for example, but the intelligence is useful for business decision-making.

It is clear that:

  • Inactive customers can be considere lost. They purchase very little, too long ago.
  • There's a big portion of our Moderate and Loyal Customer that hasn't made purchases in over 60 days. It's time to reactivate these clients.
  • Our Premium customers are the most frequent, even though the Monetart score of some Moderate customer is higher than theirs, they are the ones who hold the standard.
  • Moderate customers need special attention. A considerable portion of them made big purchases frequently, but have been absent. Some are te be considered lot, but they are the majority of our also our new clients that can be retained.

Below, you can see four plots that show in detail what has been stated above

In [104]:
plt.figure(figsize=(20,7))
sns.relplot(data=df,x='Frequency',y='Total_Purchases',col='Recency_Score',hue="Segmentation")
Out[104]:
<seaborn.axisgrid.FacetGrid at 0x16286412e50>
<Figure size 2000x700 with 0 Axes>

4. Amount spent on products by segmentation¶

In the Exploratory Data Anlysis we have come to find out that meat and wine are our champions and are even more correlated to the customers income than Gold.

After the clustering and segmentation, does this pattern remain? Let's see below

In [106]:
df.groupby('Segmentation')[["MntWines","MntFruits",'MntMeatProducts','MntFishProducts',
                           
                           'MntGoldProds']].sum().plot(kind="bar",figsize=(16,8))
plt.title("Total Amount Apent On Products By Segmentation")
plt.xticks(rotation=0)
plt.show()
  • As observed, the tendency remains. It is important to notice, though, that these are the absolute values. Let's analysise it relatively.
In [110]:
df.groupby('Segmentation')[['MntWines','MntFruits','MntFruits',
                         'MntMeatProducts','MntFishProducts',
                         'MntGoldProds']].mean().plot(kind='bar',figsize=(16,8))
plt.title("Relative Amount Spent On Products By Segmentation")
plt.xticks(rotation=0)
plt.show()

Premium customer have that title for a reason, don't you agree?

It's worthy of note that there is a clear pattern, that varies according to other factor, some yet to be uncovered by further analysis, but it's clear that meat and wine have the most notable perfomance in the products mix.

Let's now observe the average income by segmentation.

In [112]:
fig=plt.figure()
df.groupby("Segmentation")['Income'].mean().plot(kind='bar',figsize=(12,8),color= ['#003049','#D62828','#F77F00', '#2274A5', '#1F7A8C' ])
plt.title("Average Income By Segmentation")
plt.xticks(rotation=0)
plt.xlabel("Segmentation",labelpad=8)
plt.show()
  • And both the absolute and relative amounts in purchases by segmentation
In [115]:
df_sum=df.groupby("Segmentation")['Total_Purchases'].sum()
df_mean=df.groupby("Segmentation")['Total_Purchases'].mean()

fig=plt.figure(figsize=(24,12))

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)

df_sum.plot(kind='bar', color = ['#003049','#D62828','#F77F00', '#2274A5', '#1F7A8C' ], ax=ax0)
ax0.set_xticks([i for i in range(5)], df_sum.index, rotation=0, fontweight='bold', fontsize='large')

df_mean.plot(kind='bar', color = ['#006BA6','#0496FF','#FFBC42', '#D81159', '#2CA58D' ], ax=ax1)

ax1.set_title('Relative Amount in Purchases by Segmentation', fontweight='bold', fontsize='large')
ax1.set_xlabel('Segmentation', labelpad=8, fontweight='bold', fontsize='large')
ax1.set_xticks([i for i in range(5)], df_mean.index, rotation=0, fontweight='bold', fontsize='large')

plt.show()
  • Now let's analyse the performanece of the five campaigs
In [124]:
df_campaigns=df.groupby("Segmentation")[['AcceptedCmp3',"AcceptedCmp4",'AcceptedCmp5','AcceptedCmp1',
                                        'AcceptedCmp2']].sum().transpose().reset_index()
df_campaigns.index.rename('Index',inplace=True)
df_campaigns.rename(columns={"index":"Segmentation"},inplace=True)
fig=px.bar(df_campaigns,x='Segmentation',y=['Inactive','Loyal','Moderate','Occasional','Premium'],title="Campaign performance by Segmentation")
           
fig.show()

Now this is a topic that deserves thotough explanation. We can see that:

  • Campaigs 2 was an absolute disaster
  • Campaigs 3 had the best potential to retain Inactive and Moderate Clients
  • Campaigns 1 and 5 had similiar performance and if they had efforts dicrected to boost ticket value, they had potential to be a success
  • Campaigns 4 had to more potential to boost ticket value, but no so much to retain clients

5. Average Purchase Value (APV)¶

Average Purchase Value (APV) is the average amount spent by a customer on a single transaction( df['Total_Purchases']/df['Frequency'] ).

Let's create the column corresponding to this feature and the, group the customer bysegmentation and see the Average Purchase Value for each category

Calculate the Average Purchase Value (APV)

In [126]:
df['APV']=df['Total_Purchases']/df['Frequency']
df.groupby('Segmentation')['APV'].mean()
Out[126]:
Segmentation
Inactive      25.301070
Loyal               NaN
Moderate      39.873958
Occasional    37.470431
Premium             NaN
Name: APV, dtype: float64
  • The NaN values are, most likely, due to the existence of customers with Frequency = 0 . Let's check it below

Frequency assignment

In [127]:
df[df['Frequency']==0]
Out[127]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts ... Complain Response Total_Purchases Age Frequency Cluster_Frequency Cluster_Purchases Cluster_Recency Frequency_Score Recency_Score Monetary_Score RMF_Score Segmentation APV
653 5555 1975 Graduation Divorced 153924.0 0 0 2014-02-07 81 1 1 1 1 1 ... 0 0 6 48 0 3 3 2 3 1 2 6 Loyal inf
979 3955 1965 Graduation Divorced 4861.0 0 0 2014-06-22 20 2 1 1 1 0 ... 0 0 6 58 0 3 3 0 3 0 2 5 Loyal inf
1522 11110 1973 Graduation Single 3502.0 1 0 2013-04-13 56 2 1 1 0 0 ... 0 0 5 50 0 3 3 3 3 3 2 8 Premium inf
2130 11181 1949 PhD Married 156924.0 0 0 2013-08-29 85 2 1 2 1 1 ... 0 0 8 74 0 3 3 2 3 1 2 6 Loyal inf

4 rows × 39 columns

Yes, it seems to be the case. As you can see above, these customers have a poor relationship with the company, hence, their rows can be dropped and yet, the integrity of our analysis will be kept

In [129]:
df.drop([653,979,1552,2130],inplace=True)
df[df['Frequency']==0]
Out[129]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts ... Complain Response Total_Purchases Age Frequency Cluster_Frequency Cluster_Purchases Cluster_Recency Frequency_Score Recency_Score Monetary_Score RMF_Score Segmentation APV
1522 11110 1973 Graduation Single 3502.0 1 0 2013-04-13 56 2 1 1 0 0 ... 0 0 5 50 0 3 3 3 3 3 2 8 Premium inf

1 rows × 39 columns

Table Segmentation and Average Purchase Value (APV)

In [130]:
df.groupby("Segmentation")['APV'].mean().to_frame()
Out[130]:
APV
Segmentation
Inactive 25.301070
Loyal 31.696840
Moderate 39.879065
Occasional 37.470431
Premium NaN
  • After we've come to this value, it is logical to wonder how correlated are the Average Purchase Value ( APV ) and the Income . Let's have a visual perspective, including the amount of customers above the average income and above the average purchase value, represented by two lines
In [132]:
plt.figure(figsize=(12,8))
sns.scatterplot(data=df,x='APV',y='Income',hue='Segmentation')
plt.axhline(y=df['Income'].mean(),color='r',linestyle='--')
plt.axvline(x=df['APV'].mean(),color='r',linestyle='--')
plt.title('Income x APV')
plt.show()
  • The plot doesn't look exactly and the is because a clear outlier, flattening our plot. It's the customer with the maximum APV, way to far above the others' APV. Let's examine it closer
In [133]:
df[df['APV']==df['APV'].max()]
Out[133]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts ... Complain Response Total_Purchases Age Frequency Cluster_Frequency Cluster_Purchases Cluster_Recency Frequency_Score Recency_Score Monetary_Score RMF_Score Segmentation APV
1522 11110 1973 Graduation Single 3502.0 1 0 2013-04-13 56 2 1 1 0 0 ... 0 0 5 50 0 3 3 3 3 3 2 8 Premium inf

1 rows × 39 columns

In [ ]: